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Abstract 

Modeling the purposeful behavior of imper- 
fect agents from a small number of obser- 
vations is a challenging task. When re- 
stricted to the single-agent decision-theoretic 
setting, inverse optimal control techniques 
assume that observed behavior is an approxi- 
mately optimal solution to an unknown deci- 
sion problem. These techniques learn a util- 
ity function that explains the example behav- 
ior and can then be used to accurately predict 
or imitate future behavior in similar observed 
or unobserved situations. 

In this work, we consider similar tasks in 
competitive and cooperative multi-agent do- 
mains. Here, unlike single-agent settings, a 
player cannot myopically maximize its re- 
ward — it must speculate on how the other 
agents may act to influence the game's out- 
come. Employing the game-theoretic notion 
of regret and the principle of maximum en- 
tropy, we introduce a technique for predicting 
and generalizing behavior, as well as recover- 
ing a reward function in these domains. 



1. Introduction 

Predicting the actions of others in complex and strate- 
gic settings is an important facet of intelligence that 
guides our interactions — from walking in crowds to ne- 
gotiating multi-party deals. Recovering such behavior 
from merely a few observations is an important and 
challenging machine learning task. 

While mature computational frameworks for decision- 
making have been developed to prescribe the behav- 
ior that an agent should perform, such frameworks are 



often ill-suited for predicting the behavior that an 
agent will perform. Foremost, the standard assump- 
tion of decision-making frameworks that a criteria for 
preferring actions {e.g., costs, motivations and goals) 
is known a priori often does not hold. Moreover, real 
behavior is typically not consistently optimal or com- 
pletely rational; it may be influenced by factors that 
are difficult to model or subject to various types of er- 
ror when executed. Meanwhile, the standard tools of 
statistical machine learning {e.g., classification and re- 
gression) may be equally poorly matched to modeling 
purposeful behavior; an agent's goals often succinctly, 
but implicitly, encode a strategy that would require 
tremendous amounts of data to learn. 

A natural approach to mitigate the complexity of re- 
covering a full strategy for an agent is to consider iden- 
tifying a compactly expressed utility function that ra- 
tionalizes observed behavior: that is, identify rewards 
for which the demonstrated behavior is optimal and 
then leverage these rewards for future prediction. Un- 
fortunately, the problem is fundamentally ill-posed: 
in general, many reward functions can make behavior 
seem rational, and in fact, the trivial, everywhere re- 



ward function makes all behavior appear rational ( Ng 



& Russell 2000 ) . Further, after removing such trivial 
reward functions, there may be no reward function for 
which the demonstrated behavior is optimal as agents 
may be imperfect and the real world they operate in 
may be only approximately represented. 

In the single-agent decision-theoretic setting, inverse 
optimal control methods have been used to bridge 
this gap between the prescriptive frameworks and 



predictive applications ( 


Abbeel & Ng 


2004 


Ratliff 


et al. 




2006 


Ziebart et al. 


2008a 


2010 


). Success- 



ful applications include learning and prediction tasks 



in personalized vehicle route planning (Ziebart et al. 
2008a), robotic crowd navigation (iHenry et al. 120101, 
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quadruped foot placement and grasp selection ( [Ratliff 
et al. 2009). A reward function is learned by these 



techniques that both explains demonstrated behavior 



Computational Rationalization: The Inverse Equilibrium Problem 



and approximates the optimahty criteria of prescrip- 
tive decision-theoretic frameworks. 

As these methods only capture a single reward func- 
tion and do not reason about competitive or cooper- 
ative motives, inverse optimal control proves inade- 
quate for modeling the strategic interactions of mul- 
tiple agents. In this paper, we consider the game- 
theoretic concept of regret as a necessary stand-in for 
the optimality criteria of the single-agent work. As 
with the inverse optimal control problem, the result is 
fundamentally ill-posed. We address this by requiring 
that for any utility function linear in known features, 
our learned model must have no more regret than that 
of the observed behavior. We demonstrate that this 
requirement can be re-cast as a set of equivalent con- 
vex constraints that we denote the inverse correlated 
equilibrium (ICE) polytope. 

As we are interested in the effective prediction of be- 
havior, we will use a maximum entropy criteria to 
select behavior from this polytope. We demonstrate 
that optimizing this criteria leads to mini-max opti- 
mal prediction of behavior subject to approximate ra- 
tionality. We consider the dual of this problem and 
note that it generalizes the traditional log-linear max- 
imum entropy family of problems (Delia Pietra et al. 



2002). We provide a simple and computationally ef- 



ficient gradient-based optimization strategy for this 
family and show that only a small number of obser- 
vations are required for accurate prediction and trans- 
fer of behavior. We conclude by considering a matrix 
routing game and compare the ICE approach to a va- 
riety of natural alternatives. 

Before we formalize imitation learning in matrix 
games, motivate our assumptions and describe and an- 
alyze our approach, we will review the game-theoretic 
notions of regret and the correlated equilibrium. 

2. Game Theory Background 

Matrix games are the canonical tool of game the- 
orists for representing strategic interactions ranging 
from illustrative toy problems, such as the "Prisoner's 
Dilemma" and the "Battle of the Sexes" games, to im- 
portant negotiations, collaborations, and auctions. In 
this work, we employ a class of games with payoffs or 
utilities that are linear functions of features defined 
over the outcome space. 



of outcome features, F = {^^* e M^} for each out- 
come that induce a parameterized utility function, 

■ T 

Ui{a\w) — w - the reward for player i achieving 
outcome a w.r.t. utility weights w. 

For notational convenience, we let a_i denote the vec- 
tor a excluding component i and let A-i = Xj^i.jeNAi 
be the set of such vectors. 

In contrast to standard normal-form games where the 
utility functions for game outcomes are known, in this 
work we assume that "true" utility weights, w* , which 
govern observed behavior, are unknown. This allows 
us to model real-world scenarios where a cardinal util- 
ity is not available or is subject to personal taste. 

We model the players with a distribution a € 
over the game's joint-actions. Coordination between 
players can exist, thus, this distribution need not fac- 
tor into independent strategies for each player. Con- 
ceptually, a signaling mechanism, such as a traffic 
light, can be thought to sample a joint-action from 
a and communicate to each player a^, its portion 
of the joint-action. Each player can then consider 
deviating from using a modification function. 



: A,^ Ai (Blum & Mansour 20071. 



The switch modification function, for instance, 



switch^^^(ai) 



y if flj = a; 
a, otherwise 



(1) 



substitutes action y for recommendation x. 



Instantaneous regret measures how much a player 
would benefit from a particular modification function 
when the coordination device draws joint-action a, 

regret^(a|/i,u;) = Ui{fi{ai), a_i\w) - Ui{a\w) (2) 

1 T 



W 



(3) 

(4) 



Players do not have knowledge of the complete joint- 
action; thus, each must reason about the expected 
regret with respect to a modification function. 



[regret,(a|/i,w)] 



(5) 
(6) 



Definition 1. A linearly parameterized normal- 
form game, or matrix game, F = {N,A,F), is 
composed of: a finite set of players, N ; a set of 
joint-actions or outcomes, A ~ Xi^^Ai, consist- 
ing of a finite set of actions for each player, Ai; a set 



It is helpful to consider regret with respect to a class 
of modification functions. Two classes are particularly 
important for our discussion. First, internal regret 
corresponds to the set of modification functions where 
a single action is replaced by a new action, = 
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{switch^^^(-) : Va;, y e Ai}. Second, swap regret 
corresponds to the set of all modification functions, 
^T""" = {f^}■ We denote $ - U.,eN 

The expected regret with respect to $ and out- 
come distribution cr, 

i?*(f7,w) = maxEa^cr [regret,(a|/i, w)] , (7) 

is important for understanding the incentive to deviate 
from, and hence the stability of, the specified behavior. 
The most general modification class, <i)'^™'^P, leads to 
the notion of e-correlated equilibrium ([Osborne fc 



Rubinstein 



1994 ) , in which a satisfies i?* {cf,w*) < 



e. Thus, regret can be thought of as a substitute for 
utility when assessing the optimality of behavior in 
multi-agent settings. 

3. Imitation Learning in Matrix Games 

We are now equipped with the tools necessary to in- 
troduce our approach for imitation learning in multi- 
agent settings. As input, we observe a sequence of 
outcomes, {am}m=iJ sampled from a, the true be- 
havior. We denote the empirical distribution of this 
sequence, tr, the demonstrated behavior. We aim 
to learn a predictive behavior distribution, & from 
these demonstrations. Moreover, we would like our 
learning procedure to extract the motives and intent 
for the behavior so that we may imitate the players in 
similarly structured, but unobserved games. 

Imitation appears hard barring further assumptions. 
In particular, if the agents are unmotivated or their 
intentions are not coerced by the observed game, there 
is little hope of recovering principled behavior in a new 
game. Thus, we require some form of rationality. 

3.1. Rationality Assumptions 

We say that agents are rational under their true pref- 
erences when they are indifferent between a and their 
true behavior if and only if i?*((7,i(;*) < i?*(CT, w*). 

As agents' true preferences w* are unknown to the ob- 
server, we must consider an encompassing assumption 
that requires any behavior that we estimate to satisfy 
this property for all possible utility weights, or 



(8) 



Any behavior achieving this restriction, strong ratio- 
nality, is also rational, and, by virtue of the contraposi- 
tive, we see that unless we have additional information 
regarding the agents' true preferences, we must assume 
this strong assumption or we risk violating rationality. 



Lemma 1. If strong rationality does not hold for alter- 
native behavior a then there exist agent utilities such 
that they would prefer a to a. 

By restricting our attention to behavior that satisfies 
strong rationality, at worst, agents acting according to 
unknown true preference w* will be indifferent between 
our predictive distribution and their true behavior. 

3.2. Inverse Correlated Equilibria 

Unfortunately, a direct translation of the strong ra- 
tionality requirement into constraints on the distribu- 
tion a leads to a non-convex optimization problem as 
it involves products of varying utility vectors and the 
behavior to be estimated. Fortunately, however, we 
can provide an equivalent concise convex description 
of the constraints on a that ensures any feasible dis- 
tribution satisfies strong rationality. We denote this 
set of equivalent constraints as the Inverse Correlated 
Equilibria (ICE) polytope: 

Definition 2 (ICE Polytope). 



(9) 



Theorem 1. A distribution, a, satisfies the con- 
straints above for some rj if and only if it satisfies 
strong rationality. That is, Vw E , i?*((7,w) < 
R'^{a,w) if and only ifVfi E ^,3ri^^ £ A$ such that 

a^R^ = i:f,,,4^R!^. 



The proof of Theorem [T] is provided in the Ap- 



pendix (Waugh et al. 2011]). 



We note that this polytope, perhaps unsurprisingly, is 
similar to the polytope of correlated equilibrium itself, 
but here is defined in terms of the behavior we observe 
instead of the (unknown) reward function. Given any 
observed behavior a, the constraints are feasible as 
the demonstrated behavior satisfies them; our goal is 
to choose from these behaviors without estimating a 
full joint-action distribution. While the ICE polytope 
establishes a basic requirement for estimating rational 
behavior, there are generally infinitely many distribu- 
tions consistent with its constraints. 

3.3. Principle of Maximum Entropy 

As we are interested in the problem of statistical pre- 
diction of strategic behavior, we must find a mecha- 
nism to resolve the ambiguity remaining after account- 
ing for the rationality constraints. The principle of 
maximum entropy provides a principled method for 
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choosing such a distribution. This choice leads to not 
only statistical guarantees on the resulting predictions, 
but to efficient optimization. 

The Shannon entropy of a distribution a is defined as 
i/((T) — — ^r^^x ^x^'^1>^x- The principle of max- 
imum entropy advocates choosing the distribution 
with maximum entropy subject to known (linear) con- 



straints (Jaynes 1957): 



o-MaxEnt = argmaxiJ((T), subject to: (10) 

g{a) = and h{a) < 0. 

The resulting log- linear family of distributions {e.g., 
logistic regression, Markov random fields, conditional 
random fields) are widely used within statistical ma- 
chine learning. For our problem, the constraints are 
precisely that the distribution is in the ICE polytope, 
ensuring that whatever distribution is learned has no 
more regret than the demonstrated behavior. 

Importantly, the maximum entropy distribution sub- 
ject to our constraints enjoys the following guarantee: 

Lemma 2. The maximum entropy ICE distribution 
minimizes over all strongly rational distributions the 
worst-case log-loss , ~ X^ae^ '''a log (Tq, when a is cho- 
sen adversarially and subject to strong rationality. 

The proof of Lemma [2] follows immediately from the 
result of Grunwald and Dawid (20031. 



In the context of multi-agent behavior, the principle of 
maximum entropy has been employed to obtain corre- 
lated equilibria with predictive guarantees in normal- 



form games when the utilities are known a priori { Or- 



tiz et al. 2007). We will now leverage its power with 



our rationality assumption to select predictive distri- 
butions in games where the utilities are unknown. 

3.4. Prediction of Behavior 

Let us first consider prediction of the demonstrated 
behavior using the principle of maximum entropy and 
our strong rationality condition. After, we will extend 
to behavior transfer and analyze the error introduced 
as a by-product of sampling a from a. 

The mathematical program that maximizes the en- 
tropy of a under strong rationality with respect to (t. 



argmax i/((T), subject to: 



(11) 



is convex with linear constraints, feasible, and 
bounded. That is, it is simple and can be efficient 
solved in this form. Before presenting our preferred 
dual optimization procedure, however, let us describe 
an approach for behavior transfer that further illus- 
trates the advantages of this approach over directly 
estimating a . 

3.5. Transfer of Behavior 

A principal justification of inverse optimal control 
techniques that attempt to identify behavior in terms 
of utility functions is the ability to consider what be- 
havior might result if the underlying decision problem 
were changed while the interpretation of features into 



utilities remain the same (Ng & Russell 2000 Ratliff 



et al. 2006 1 . This enables prediction of agent behavior 



in a no-regret or agnostic sense in problems such as a 



robot encountering novel terrain (Silver et al. 2010) 



as well as route recommendation for drivers traveling 



to unseen destinations ( Ziebart et al. 2008b I . 



Econometricians are interested in similar situations, 
but for much different reasons. Typically, they aim 
to validate a model of market behavior from observa- 
tions of product sales. In these models, the firms as- 
sume a fixed pricing policy given known demand. The 
econometrician uses this fixed policy along with prod- 
uct features and sales data to estimate or bound both 
the consumers' utility functions as well as unknown 
production parameters, like markup and production 



cost (Berry et al. 1995 Nevo 2001 Yang 2009). In 



this line of work, the observed behavior is considered 
accurate to start with; it is not suitable for settings 
with limited observations. 

Until now, we have considered the problem of identi- 
fying behavior in a single game. We note, however, 
that our approach enables behavior transfer to games 
equipped with the same features. We denote this un- 
observed game as T. As with prediction, to develop 
a technique for behavior transfer we assume a link 
between regret and the agents' preferences across the 
known space of possible preferences. Furthermore, we 
assume a relation between the regrets in both games. 
Property 1 (Transfer Rationality). For some con- 
stant K > 0, 



Vw, i?*(a,w) < kR!^ {a,w). 



(12) 



Roughly, we assume that under preferences with low 
regret in the original game, the behavior in the unob- 
served game should also have low regret. By enforcing 
this property, if the agents are performing well with 
respect to their true preferences, then the transferred 
behavior will also be of high quality. 
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As we are not privileged to know k and this prop- 
erty is not guaranteed to hold, we introduce a slack 
variable to allow for violations of the strong rational- 
ity constraints to guaranteeing feasibility. Intuitively, 
the transfer-ICE polytope we now optimize over re- 
quires that for any linear reward function and for ev- 
ery player, the predicted behavior in a new game must 
have no more regret than demonstrated behavior does 
in the observed game using the same parametric form 
of reward function. The corresponding mathematical 
program is: 



Algorithm 1 Dual MaxEnt ICE 



(13) 



max H{a) ~ Cv, subject to: 



In the above formulation, C > is a slack penalty 
parameter, which allows us to choose the trade-off be- 
tween obeying the rationality constraints and maxi- 
mizing the entropy. Additionally, we have omitted k 
above by considering it intrinsic to R. 

We observe that this program is almost identical to the 
behavior prediction program introduced above. We 
have simply made substitutions of the regret matrices 
and modification sets in the appropriate places. That 
is, if r = r, we recover prediction with a slack. 

Given a and we can bound the violation of the 
strong rationality constraint for any utility vector. 

Lemma 3. // a violates the strong rationality con- 
straints in the slack formulation by v then for all w 



One could choose to institute multiple slack variables, 
say one for each fi £ ^, instead of a single slack across 
all modification functions. Our choice is motivated by 
the interpretation of the dual multipliers presented in 
the next section. There, we will also address selection 
of an appropriate value for C. 

4. Duality and Efficient Optimization 

In this section, we will derive, interpret and describe 
a procedure for optimizing the dual program for solv- 
ing the MaxEnt ICE problem. We will see that the 
dual multipliers can be interpreted as utility vectors 
and that optimization in the dual has computational 
advantages. We begin by presenting the dual of the 



Inputs T, 7, C > 0, i?, $ and $ 
V/, e $, af^pf' ^ l/(|$|i^ + l) 
for t from 1 to T do 

/* compute the gradient */ 

Va eA, exp (- J2f^e^ ^{f (a^' " P^') 



for /i e $ do 

f* ^ argmax^^g^CTT^/j(-Q,/, _ ^/.) 

gf^^a^B^i -Ea^A^aft/Z 
end for 

/* descend and project */ 

P ^ 1 + E/.,fe "fc exp(-7t5{') -I- exp(7t<7{') 
V/j e i>, fc e if, ^ Caf* exp{--ftgl')/p 
V/,; e $, fc e K, ^ Cpi^ expijtgl')/p 

end for 

return (a, (3) 

transfer program. 

max \a^Rj' {a^' - (5^')] + log Z{a, P) 



mm 



K 



subject to: ^ + ^ ^ a{' + /3{' = C, a, /3, ^ > 0. 
where Z(a, {3) is the partition function. 



Z{a,P) = ^ exp 



Removing the equality constraint is equivalent to dis- 
allowing any slack. We derive the dual in the ap- 



(14) pendix ( jWaugh et al.[|2011 ). 



For C > 0, the dual's feasible set has non-empty inte- 
rior and is bounded. Therefore, by Slater's condition, 
strong duality holds ~ there is no duality gap. In par- 
ticular, we can use a dual solution to recover a. 

Lemma 4. Given a dual solution, (a,/3), we can re- 
cover the primal solution, a. Specifically, 

oa = exp X] - j /Z{a, (3). (15) 



Intuitively, the probability of predicting an outcome is 
small if that outcome has high regret. 

In general, the dual multipliers are utility vectors as- 
sociated with each modification function in $. Under 
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the slack formulation, there is a natural interpretation 
of these variables as a single utility vector. Given a 
dual solution, (a, /?) with slack penalty C, we choose 



k=l 



K 



pt , and 



w 



(16) 
(17) 
(18) 



That is, we can associate with each modification func- 
tion a probability, tt^' , and a utility vector, A-^' . Thus, 
a natural estimate for w is the expected utility vector. 
Note, X^/iG* iissd not sum to one. The remaining 
mass, ^, is assigned to the zero utility vector. 

The above observation implies that introducing a slack 
variable coincides with bounding the Li norm of the 
utility vectors under consideration by C. This insight 
suggests that we choose C > llu'*!!^, if possible, as 
smaller values of C will exclude w* from the feasi- 
ble set. If a bound on the Li norm is not available, 
we may solve the prediction problem on the observed 
game without slack and use \\w\\^ as a proxy. 

The dual formulation of our program has important 
inherent computational advantages. First, it is a opti- 
mization over a simple set that is particularly well- 
suited for gradient-based optimization, a trait not 
shared by the primal program. Second, the number 
of dual variables, 2|$|i4r, is typically much fewer than 
the number of primal variables, \A\ + 2|<i>p. Though 
the work per iteration is still a function of \A\ (to com- 
pute the partition function), these two advantages to- 
gether let us scale to larger problems than if we con- 
sider optimizing the primal objective. Computing the 
expectations necessary to descend the dual gradient 
can leverage recent advances in the structured, com- 
pact game representations: in particular, any graphi- 
cal game with low-treewidth or finite horizon Markov 



game ( Kakade et al. 2003 1 enables these computations 



to be performed in time that scales only polynomially 
in the number of decision makers or time-steps. 

Algori thm [I| employs exp onentiated gradient de- 
scent (Kivinen & Warmuthl 19951 to find an optimal 



dual solution. The step size parameter, 7, is commonly 
taken to be -^^2 log |$| A'/A, with A being the largest 
value in any . With this step size, if the optimiza- 
tion is run for T > 2A^ log /e^ iterations then 
the dual solution will be within e of optimal. Alter- 
natively, one can exactly measure the duality gap on 
each iteration and halt when the desired accuracy is 
achieved. This is often preferred as the lower bound 
on the number of iterations is conservative in practice. 



5. Sample Complexity 

In practice, we do not have full access to the agents' 
true behavior - if we did, prediction would be straight- 
forward and not require our estimation technique. In- 
stead, we can only approximate it through finite obser- 
vation of play. In real applications there are costs as- 
sociated with gathering these observations and, thus, 
there are inherent limitations on the quality of this 
approximation. In this section, we will analyze the 
sensitivity of our approach to these types of errors. 

First, although |^| is exponential in the number of 
players, our technique only accesses a through prod- 
ucts of the form aRj . That is, we need only approx- 
imate these products accurately, not the distribution 
a. As a result, we can bound the approximation error 
in terms of |$| and K. 

Theorem 2. With probability at least 1 ~ S, for any 
w, by observing M > ^ log outcomes we have 

R'^ia,w) < R'^{a,w) +eA||w;||^. 

The proof is an application of Hoeffding's inequality 



and is provided in the Appendix ( Waugh et al. 2011 1 



As an immediate corollary, considering only the true, 
but unknown, reward function w*: 

Corollary 1. With probability at least 1 — 6, by 
sampling according to the above rule, R'^{a,w*) < 
i?*(cr, w*) -|- (eA -I- u) \\w*\\^ for a with slack v. 

That is, so long as we assume bounded utility, with 
high probability we need only logarithmic many sam- 
pies in terms of |$| and K to closely approximate cri?j^ 
and avoid a large violation of our rationality condition. 

We note that choosing (f> = is particularly appeal- 
ing, as |$'"*| < compared to |$swap| < |^|^|^ 
As internal regret closely approximates swap regret, 
we do not lose much of the strategic complexity by 
choosing the more limited set, but we require both 
fewer observations and fewer computational resources. 



6. Experimental Results 

To evaluate our approach experimentally, we designed 
a simple routing game shown in Figure [l] Seven 
drivers in this game choose how to travel home during 
rush hour after a long day at the office. The differ- 
ent road segments have varying capacities, visualized 
by the line thickness in the figure, that make some of 
them more or less susceptible to congestion or to traffic 
accidents. Upon arrival home, each driver records the 
total time and distance they traveled, the gas that they 
used, and the amount of time they spent stopped at 
intersections or in traffic jams - their utility features. 
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Figure 1. A visualization of the routing game. 



Figure 2. Prediction error (log-loss) as a function of num- 
ber of observations. 



In this game, each of the drivers chooses from four 
possible routes (solid lines in Figure [T]), yielding over 
16,000 possible outcomes. We obtained an e-social 
welfare maximizing correlated equilibrium for those 
drivers where the drivers preferred mainly to minimize 
their travel time, but were also slightly concerned with 
gas usage. The demonstrated behavior a was sampled 
from this true behavior distribution a. 

First, we evaluate the differences between the true be- 
havior distribution a and the predicted behavior dis- 
tribution a trained from observed behavior sampled 
from a. In Figure [2] we compare the prediction accu- 
racy when varying the number of observations using 
log-loss, — X^ae^^a^^S'^'a- '^^^ basehne algorithms 
we compare against are: a maximum likelihood esti- 
mate of the distribution over the joint-actions with a 
uniform prior, an exponential family distribution pa- 
rameterized by the outcome's utilities trained with lo- 
gistic regression, and a maximum entropy inverse op- 



Table 1. Transfer error (log-loss) on unobserved games. 



timal control approach ( Ziebart et al. 2008a ) trained 



individually for each player. 

In Figure[2] we see that MaxEnt ICE predicts behavior 
with higher accuracy than all other algorithms when 
the number of observations is limited. In particular, 
it achieves close to its best performance with as few 
at 16 observations. The maximum likelihood estima- 
tor eventually overtakes it, as expected since it will 
ultimately converge to cr, but only after 10,000 obser- 
vations, or about as many observations as there are 
outcomes in the game. This experiment demonstrates 
that learning underlying utility functions to estimate 
observed behavior can be much more data-efficient for 
small sample sizes, and additionally, that the regret- 
based assumptions of MaxEnt ICE are both reasonable 



Problem 


Logistic Model 


MaxEnt Ice 


Add Highway 


4.177 


3.093 


Add Driver 


4.060 


3.477 


Gas Shortage 


3.498 


3.137 


Congestion 


3.345 


2.965 



and beneficial in our strategic routing game setting. 

Next, we evaluate behavior transfer from this routing 
game to four similar games, the results of which are 
displayed in Table [T] The first game. Add Highway, 
adds the dashed route to the game. That is, we model 
the city building a new highway. The second game. 
Add Driver, adds another driver to the game. The 
third game, Gas Shortage, keeps the structure of the 
game the same, but changes the reward function to 
make gas mileage more important to the drivers. The 
final game, Congestion, adds construction to the major 
roadway, delaying the drivers. 

These transfer experiments even more directly demon- 
strate the benefits of learning utility weights rather 
than directly learning the joint-action distribution; di- 
rect strategy-learning approaches are incapable of be- 
ing applied to general transfer setting. Thus, we only 
compare against the Logistic Model. We see from 
Table [l] that MaxEnt ICE outperforms the Logistic 
Model in all of our tests. For reference, in these new 
games, the uniform strategy has a loss of approxi- 
mately 6.8 in all games, and the true behavior has 
a loss of approximately 2.7. 
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7. Conclusion 

In this paper, we extended inverse optimal control 
to multi-agent settings by combining the principle of 
maximum entropy with the game-theoretic notion of 
regret. We observed that our formulation has a partic- 
ularly appealing dual program, which led to a simple 
gradient-based optimization procedure. Perhaps the 
most appealing quality of our technique is its theo- 
retical and practical sample complexity. In our ex- 
periments, MaxEnt ICE performed exceptionally well 
after only 0.1% of the game had been observed. 
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Appendix 

Rationality Properties and Primal Programs 

The proof of Theorem [l] relies upon the following technical lemmas. 
Lemma 5. 

b^w < maxui^w 3A £ s.t. b^w < X^Aw. 



Proof of Lemma^ Given b'^w < maxa-t^A di^w, choose 

A, = 



Thus, b'^w < max^.g^ Ui'^w — X^Aw. 
Given 3X e s.t. b'^w < X^Aw, 



Lemma 6. 



1 if Oi = argmax^^g^ Ui^w 
otherwise 



b'^w < X'^Aw 
< Xn maxaj'^i 



T 

max a.i w 

ai£A 



T 

: max Oi w 



Vw gM.^, b^w < maxai'^w ^3Xe Aa s.t. b = X^A. 



(19) 



(20) 
(21) 

(22) 
(23) 
□ 



Proof of Lemma 

Vw e , b'^w < max Oi'^w (24) 

^ Vw e M^, 3A e Aa s.t. b^w < X'^Aw (25) 
^Vw e M^,3A e Aa s.t. [6- A'^Aj^u; < (26) 

<^ the following linear program has optimal value 

max b^w — t (27) 

subject to: t > ai^w,yai £ A. 

The following linear feasibility problem is the dual of the above program 

min (28) 

A 

subject to: b — X^A 
X e A^. 

By strong duality for linear programming, the primal has value iff the dual is feasible, which is exactly when 
3A e A^ s.t. b = X'^A. □ 
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Proof of Theorem^ 

Vw e R^, R'^{&,w) < R'^{a,w) (29) 
<^ Vw e M^, maxa^R{'w < ma.xa^R{'w (30) 

^ V/, e $, Vw e R'^, a^Rf'w < max w (31) 

The last step makes use of our second technical lemma. □ 

Derivation of the Dual Program 

The Lagrange dual is 



min max — CTq logd-Q — Ciy — I ctR{' — \^ r]lia'^R^j' ^ ^ 1 (33) 

aeA f,e<S> \ /jG* / 

+ E E + E'^-^-+^^ (^6) 

subject to: a, /3, -u, w, ^ > (37) 
To solve the unconstrained inner optimization, we take derivatives w.r.t. cr, rj and v and set equal to 0: 

log a, = -1 - 51 ^{a(a^' - /3-^0 - '5 + ^'a = 0, (38) 

a'^R^' (a-^- - Z?-''') - 7-^' + u^f] = 0, V/„ € G and (39) 

C - C + ^ a^' + = 0. (40) 

Substituting into the Lagrangian, we get 



min V 7^' + ^ + exp(-l - 5) ^ exp - ^ f{;(a-f' _/?/.)+ „1 (41) 
/.ei- aeA \ fie'i J 

subject to: (a-''- - - 7-''' + ^t/^ = 0, V/, eij^G* (42) 

C+ ^ a^'+/3^' (43) 

a,/3,u,^;,C>0. (44) 
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We note that u are slack variables, and that, by inspection, w = at optimality. Thus, an equivalent program is 



min j-^^ + S + exp(— 1 — 5) \^ exp 

/iS* aeA 

subject to: a'^ X^' < 7^' , V/, £ $, fj G $ 
a,/3,e>0. 



(45) 

(46) 
(47) 

(48) 



We eliminate S by setting its partial derivative to 0, solving for S 



<5 = log I ^ exp U ^ f{:Jaf^ - pf^) | | - 1 



(49) 



and substituting back into the objective 



subject to: d'^ R^' (a^' - ) < 7^" , V/^ G $, /j G $ 
a,/3,^>0. 



(50) 

(51) 
(52) 

(53) 



By inspection, at optimality, = max^^.g^ a"^ R^^' (a^' — /S-^'). Thus an equivalent program is 



min >^ 



max5-^i?{^ (a^* - 



+ log I 5] exp - ^ f{;(a/' I I _ 1 



a,/3,f >0. 



(54) 

(55) 
(56) 



Proof of Lemma In the derivation of the dual program, we observed that at optimality 



loga„ = -l-J2 - /3f^)-S + Va = 0. 

he<s> 



(57) 



Noting V = and substituting for the optimal 6, we get 



(58) 



All that remains is to exponentiate both sides. 



□ 
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Sample Complexity 

Proof of Theorem\R 



P max - U > ^AM ) < F |J - \k > eAM (59) 

< 5] p(|ai?/*-f7i?fU>eAAf) (60) 



^ E 2exp(^) (61) 

fie'S>,keK 



= 2|$|ifexp(^^) (62) 
< S (63) 
We use the union bound in step 2, and Hoeffding's inequality in step 3. Solving for M, we get our result 



M>^log^^. (64) 


□ 



Proof of Corollary^ We have Vw, R'^{a, w) < i?*(CT, w) + ly \\w\\-^, where p depends on the choice of the slack's 
penalty Thus, we have i?*((7, w*) < R'^ia,w*) + u\\w\\^ < R'^{a,w*) + {eA + iy)\\w*\\^ with probability at least 
1 — (5, so long as M is as large as Theorem [2] deems. We can make v as small as we like by increasing the slack 
penalty. □ 



